World Data League 2022

Notebook Submission Template

This notebook is one of the mandatory deliverables when you submit your solution. Its structure follows the WDL evaluation criteria and it has dedicated cells where you should add information. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work. Make sure to list all the datasets used besides the ones provided.

🎯 Challenge

Optimization of Soft-Mobility Drop-off Points

Theory

As described in the challenge's introduction, Porto has a poor traffic situation and e-scooters might aid in alleviating this problem. However, one needs to understand who are the scooter users:

  1. People above 13-14 and below 40 years of age - mostly represented by school and university students, as well as young professionals.
  2. Tourists visiting Porto (with similar demographics to those above).

Since the core demographic of point 1 is likely to have a subsidized bus/tram monthly ticket, this makes bus/tram the primary transportation means of these users. Our hypothesis is therefore that scooters are not a replacement for bus/tram but a complement. This hypothesis can be checked by determining the mean/median trip length, which we do in section EDA, subsection E-Scooter OD. See more details in Model 1 and 2 below.

Furthermore, we hypothesize that the scooters are being used at the start of the trip - and then followed by bus/metro - or the opposite.

Datasets

We have used the original Bus and Metro GTFS datasets, as well as the E-scooter ride and park datasets. As a complement, we have also included Census data from the Portuguese National Institute of Statistics and amenity data from Open Street Map.

Analysis

Assuming that the e-scooters are a complement to the current public transportation network, we decided to analyze what were the busiest bus and metro stops and use this as hotspots (i.e, points which likely require a scooter park) in our model. We first attempted to use passenger Ticket Validation data (metro and bus) from the third challenge. This, however, proved to be inefective because validation data is from 2020, which is strongly disrupted due to the Covid-19 lockdowns and affected particularly our public demographic - students and young professionals.

Therefore, as a measure of passenger volume, we instead used the average number of bus/metro rides during a day through a given stop, as detailed in section EDA, subsections Bus GTFS and Metro GTFS. This should be a proxy for volume, as the transportation company (STCP) likely assigns vehicle frequency proportional to the number of passengers in a given route. However, this proxy fails if there are inefficiencies in the system - e.g, there are stops for which the few passing buses are packed and stops for which the frequent buses are half-empty.

In our view, our scooters should be integrated with the current transportation network, but closely linked to the amenities that young people are likely to use. Therefore, we extracted amenity data from Open Street Map and use this to compute an amenity importance score. This goal is that sites with a larger score would be assigned a larger number of e-scooter parks.

Amenity score

We select a subset of amenities which are likely to be overused by the target population - universities, schools, fast-food restaurants, ... - and assign an importance weight to each. We then compute the distance in Km from the point of interest (POI) to the amenity and generate a score using:

$Score_{Amenity}^{POI}(Weight, Distance) = Weight \times f(Distance)$

For the amenity score of Model 1 - call this standard amenity score - we have used $f(Distance) = \theta(Threshold- Distance)/(reg + Distance)^2$, where reg is a regularization term and Threshold guarantees that only amenities sufficiently close to a POI are considered.

For the amenity score of Model 2, the points of interest are the stops. Therefore, we want to penalize amenities that are too close to these - because we want the scooter parks to be close to a bus stop which itself is not at walking distance of an important amenity. To that effect, we use $f(Distance) = \theta(Threshold- Distance)Burr(c,k)$, where $Burr$ is the Burr distribution.

Modeling

We propose two preliminary models:

Model 1 - Use weighted K-Means in a user-defined grid:

  1. Split Porto in a grid of points. For each grid of points, compute the associated population using the Census data.
  2. Compute an amenity score for each point in this grid. In this particular case, we include bus/metro stops as amenities.
  3. Use weighted k-Means - with the number of clusters equal to the number of parks - to design a rough distribution of parks, to be adjusted later as per the constraints of the problem (dispersion of parks, geographical considerations, ...)

Model 2 - Use weighted K-Means on the bus/metro stops:

  1. Determine the level of usage of bus/metro stops.
  2. Compute a modified amenity score of each bus/metro stops.
  3. Apply a weighted clustering algorithm to the stops - with the number of clusters equal to the number of parks - where the weights are a combination of
    • Positive (reward) weights: Average number of rides, amenity score;
    • Negative (penalty) weights: Density of bus stops - if there is a large density of stops, it is possible that the transportation network is well-connected and there is less need for the scooters.

Results

To evaluate the models with respect to the current solution, we computed the standard amenity score for the current e-scooter parks and those created by Models 1 and 2. If our models are competitive, then their overall amenity scores should be larger than that of the current parks, meaning that they are better located.

Unfortunately, this is not the case, and both of our models underperform (with Model 2, which uses the bus stops as a reference, being slightly better). This is partially due to the fact that the grid points and the bus stops cover a part of Porto (southeast) for which there are very few marked amenities and no current scooter parks.

👥 Authors

💻 Development

Start coding here! 🐱‍🏍

Install and Imports

Importing the data

Extracting the GTFS data

Extracting population data for the city of Porto

Extracting amenity data from Open Street Maps using the Overpass-API

Joining the node and way data to form the amenities DataFrame.

EDA

Defining the map of Porto with amenities

Amenities dataset

We have to remove primary schools from the dataset, as the demographics that attends these schools is not old enough to ride the scooters. One could argue that we should remove the basic schools as well.

Bus GTFS dataset

Key conclusions: Determined the ride volume per stop and represented it in the map of Porto. This allows us to understand which areas of Porto are currently being well targeted by public transport and which are not.

Importing the data

Agency

Stop times

This is a crucial piece of information to know how many buses pass at a certain stop per day.

Stop location

Because the scooter location is restricted to the Porto area, we shall focus on the Porto bus stops.

How many buses service a give stop throughout a business day?

Using stop times and stops

We define the average number of bus/metro rides using the rides ocurring on working days:

First, we merge the stop_times_df and the relevant part of stops_df.

Defines a hourly sampled time series - from 00:00 to 23:00 - of the buses that travel through a given stop, for all the stops:

On average, each stop has about four buses passing by per hour. However, there are much busier locations (like Campo 24 de Agosto) which have over 20 buses per hour.

We now know what are the $n_{busiest}$ stops (in terms of average number of buses per hour). Its map looks something like:

Some examples are:

Using quantiles is quite insightful:

As expected, the center of Porto has the largest average number of rides - partially due to the fact that more lines cross the center of Porto. Key arteries of Porto like Avenida da Boavista also have a large number of rides (as represented by the yellow stars).

Metro GTFS dataset

Key conclusions: Determined the ride volume per stop and represented it in the map of Porto. This allows us to understand which areas of Porto are currently being well targeted by public transport and which are not. Unfortunately this dataset has the problem that there is ride data missing for the yellow line (line D). See map represented below.

Importing the data

Agency

Stop times

This is a crucial piece of information to know how many buses pass at a certain stop per day.

Stop location

Because the scooter location is restricted to the Porto area, we shall focus on the Porto bus stops.

How many trams service a give stop throughout a business day?

Using stop times and stops.

Subselecting on working day rides

We want to apply this to all the stops in the Porto area. First, we merge the stop_times_df and the relevant part of stops_df.

Determining the busiest stops:

More details on the missing data hinted above:

IPO does not have only 4 trips per day, data is missing.

E-Scooter OD

Key Conclusions:

  1. Generated an heatmap for scooter rides, which allows us to understand where the users are riding the scooters;
  2. Checked what are the reasons why scooters are not available - it looks like riding off juridiction is a common problem;
  3. Computed how many scooters have been decomissioned since the program started. This is relevant for policy makers because, for scooters to be an alternative to traditional transportation, they have to be reliable;
  4. Computed the distribution of ride lengths to check our hypothesis. There are plenty of long rides, so certainly some people are using them as a replacement of bus/tram, which would refute our hypothesis. However, we are unsure if using a constant average ride velocity for all trip lengths is reasonable - our expectation is, as the rides become longer (in time), users take more breaks such that their average speed goes down and with it, the traveled distance;
  5. Checked the start and end times for the rides;

Check what is the multiplicity of the columns:

Split scooters into working and non-working to have an idea of what percentage (from total) of scooters is working:

A scooter is either running or available 47% of the time, which seems a bit low. Maybe their batteries are low and they need to be recharged often? This is important to understand because the scooter might have some limitations as a means of transport - apart from carrying only 1/2 young people, generally used when the weather is good and roads are dry.

Let us investigate why the scooters do not work:

Elsewhere and removed are the most common reasons.

There are/were 3720 different scooters in the system.

Only three percent of the total amount of scooters have been have been decommissioned.

This, of course, is the total life time in hours. It would be interesting to see what is the running time.

Computing the working time of the scooters and with it, the working distance:

Let us assume that the longest ride takes 20 mins ~ 0.33h, which would still be around 7km:

Interestingly, it resembles a Burr distribution.

Check the dependence with start and end times

E-Scooter Parks

Key conclusions Where are the current e-scooter parks located.

Doing a quick clustering - a more educated guess to the number of clusters could be determined by using the silhouette score.

Note that the scooters were often dropped by the company outside of the parks, which is rather strange.

Modeling

Model 1

  1. Split Porto in a grid of points. For each grid of points, compute the associated population using the Census data.
  2. Compute an amenity score for each point in this grid. In this particular case, we include bus/metro stops as amenities.
  3. Use weighted k-Means - with the number of clusters equal to the number of parks - to design a rough distribution of parks, to be adjusted later as per the constraints of the problem (dispersion of parks, geographical considerations, ...)

Generating an equally spaced grid of points.

Note that some points are in the middle of the river, which would need to be solved.

Initial clustering

Comparing the amenity score between the new scooter parks and the old ones:

The original parks perform better - regarding our amenities score - than the new parks.

Model 2

  1. Determine the level of usage of bus/metro stops.
  2. Compute a modified amenity score of each bus/metro stops.
  3. Apply a weighted clustering algorithm to the stops - with the number of clusters equal to the number of parks - where the weights are a combination of
    • Positive (reward) weights: Average number of rides, amenity score;
    • Negative (penalty) weights: Density of bus stops - if there is a large density of stops, it is possible that the transportation network is well-connected and there is less need for the scooters.

Determining stop density - penalty

Let us normalize the three weights for convenience:

Although model 2 performs better than model 1, it still underperforms the current setting.

🖼️ Visualisations

Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.

Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.

Image_0_Validations_per_day.png

Image_1_Amenities.png

Image_2_Time_Series_StAnthony_Hospital.png

Image_3_BusStops_Busiest.png

Image_4_Histogram_Scooter_Ride_Length.png

Image_5_Histogram_Scooter_Start_End.png

Image_6_Amenity_Quantile_Score_Grid.png

Image_7_Model_1_Results.png

Image_8_Model_2_Results.png

👓 References

List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.

Census data - Portuguese National Institute of Statistics

Amenity data from Open Street Map

Burr distribution